Edge-enhanced infrared image super-resolution reconstruction model under transformer

Infrared images have important applications in military, security and surveillance fields. However, limited by technical factors, the resolution of infrared images is generally low, which seriously limits the application and development of infrared images in various fields. To address the problem of difficult recovery of edge information and easy ringing effect in the super-resolution reconstruction process of infrared images, an edge-enhanced infrared image super-resolution reconstruction model TESR under transformer is proposed. The main structure of this model is transformer. First, in view of the problem of difficult recovery of edge information of infrared images, an edge detection auxiliary network is designed, which can obtain more accurate edge information from the input low-resolution images and enhance the edge details during image reconstruction; then, the CSWin Transformer is introduced to compute the self-attention of horizontal and vertical stripes in parallel, so as to increase the receptive field of the model and enable it to utilize features with higher semantic levels. The super-resolution reconstruction model proposed in this paper can extract more comprehensive image information, and at the same time, it can obtain more accurate edge information to enhance the texture details of super-resolution images, and achieve better reconstruction results.

• An edge-enhanced infrared image super-resolution reconstruction model under transformer (TESR) is proposed.The deep feature extraction module is designed by introducing residual cross-shaped windows in the SR backbone network, and the edge features extracted by the edge-assisted network are additionally introduced in the deep feature extraction process to enhance the SR detail effect of infrared images with high magnification factors.This approach not only provides a larger receptive field but also leverages available edge information to learn and fill in missing pixels, reconstructing SR infrared images at different magnification factors.• Aiming at the problems of insufficient extraction of detail features and lack of edge information in exist- ing models, this paper designs a deep feature extraction module and an edge detection auxiliary network to jointly enhance texture and edge features.In the deep feature extraction module, the self-attention of horizontal and vertical stripes is computed in parallel by means of a cross-shaped window, which increases the receptive field and captures more detailed features.The edge detection auxiliary network improves the image resolution by up-sampling, which improves the quality of the edge information in the image, making the extracted edges under rich convolution finer and more complete.
• The experiments conducted on infrared image datasets demonstrate that our proposed method exhibits superior super-resolution image reconstruction performance, particularly in the edge prominent regions, compared to methods such as SwinIR, especially at an 8 × magnification factor.

Infrared image super-resolution
The successful application of deep learning SR models to visible light images has prompted many researchers to apply them to infrared images.However, the performance of directly applying visible light SR models to infrared images is often unsatisfactory.Inspired by the SRCNN method, Choi et al. 27 proposed the thermal enhancement network (TEN), which is trained using visible spectral data to enhance the resolution of infrared images.However, the improvement achieved is quite limited due to the differences between the two spectra.He et al. 28 proposed Cascaded deep network with multiple receptive fields for infrared image super-resolution (CDNMRF).Marivani et al. 29 proposed multimodal image SR using visible images to provide auxiliary information.Zou et al. 30 explored an infrared image super-resolution reconstruction method based on a skip connection convolutional neural network, which extracts image features through convolutional layers and recovers image details through deconvolutional layers.Prajapati et al. 26 proposed a channel splitting-based convolutional neural network (ChasNet), which utilizes channel splitting to extract high-frequency features of infrared images.Gutierrez et al. 31 designed the (AVRFN) model by combining dilated convolutions and second-order channel attention.Yang et al. 32 invented the spatial attention residual network (SAResNet), composed of spatial attention and residual blocks.Both networks aim to improve the accuracy of reconstructed images through attention mechanisms.However, Du et al. 33 abandoned the attention mechanism and achieved high reconstruction accuracy by capturing a larger receptive field through hybrid convolutions with multi-scale residuals.

Vision transformer
Transformer was proposed by Vaswani et al. 13 in 2017, which is a model that does not contain convolution and is entirely based on the attention mechanism.Given the successful applications of Transformer in the field of natural language processing, there has been a gradual exploration of its potential application in the field of computer vision.Dosovitskiy et al. 14 first introduced the transformer for computer vision tasks, proposing the vision transformer (ViT).ViT performs multi-head self-attention on global feature maps, achieving promising results and demonstrating the effectiveness of transformer in the field of vision.Liu et al. 10 proposed the Swin Transformer, which employs multi-head self-attention solely within local windows to reduce computation time.
Additionally, it utilizes a shifted window mechanism to enhance information interaction between windows.Dong et al. 15 introduced the core design of the CSWin Transformer, centred around the cross-shaped window self-attention (CSWSA) module.By parallelly executing self-attention for horizontal and vertical stripes within

Network architecture
The overall structure of TESR is shown in Fig. 1, and the model contains two sub-networks: the SR backbone network and the edge detection auxiliary network (EDAN), in which the SR backbone network consists of three modules: shallow feature extraction module (SFEM), deep feature extraction module (DFEM), and reconstruction module (RECM).
EDAN is composed of a bicubic upsampling layer for bilinear interpolation, a rich convolutional edge detection network RCF 16 , and a convolutional module Conv_block.EDAN performs edge extraction on the input LR image, denoted as I LR , to obtain edge features F E .These edge features assist the SR main network in reconstruct- ing high-quality infrared images.
SFEM is a convolutional layer with a convolutional kernel size of 3 × 3 used to extract the shallow feature F 0 from the input I LR .
DFEM mainly includes high-frequency texture feature extraction and edge feature fusion.The high-frequency texture feature extraction contains K residual cross-shaped window transformer blocks (RCSTB), and the features F 1 , F 2 , …, F K−1 and F K are extracted from the F 0 with the K RCSTB modules in order to obtain the high- frequency texture feature F K .
After combining F K and F E in DFEM, the fused features undergo refinement through a 3 × 3 convolutional layer.Subsequently, deep features F DEF are obtained by fusing the refined features with F 0 through long skip connections. (1) where H CONV (•) denotes a 3 × 3 convolutional layer.The long skip connections in DFEM facilitate the fusion of low-frequency and high-frequency information, allowing DFEM to focus on the extraction of high-frequency information and edge enhancement.To enhance the edge information in F K effectively, this paper introduces a balancing factor α and overlays F E onto F K with adjusted intensity.
RECM consists of sub-pixel convolution and a 3 × 3 convolutional layer, employed to reconstruct a highquality SR image I SR from F DEF .

Residual cross-shaped window transformer block
This paper designs the RCSTB based on the CSWin Transformer, as illustrated in Fig. 2. The RCSTB is comprised of L cross-shaped window transformer layers (CSTL) and a 3 × 3 convolutional layer.The structure of CSTL is depicted in Fig. 3.
Initially, CSTLs are employed to sequentially extract features where H CSTL i,j (•) denotes the j-th CSTL in the i-th RCSTB.Subsequently, F i,L is refined using a 3 × 3 convolutional layer, and the output feature F i,out is obtained by merging it with the input feature F i,0 .
where H CONV i (•) denotes the 3 × 3 convolutional layer in the i-th RCSTB.
CSTL is based on the standard multi-head self-attention of the original transformer layer, consisting of the cross-shaped window self-attention (CSWSA) module, a multi-layer perceptron (MLP), and layer normalization (LN) 17 .There are two main improvements over the original transformer layer, CSWSA (cross-shaped window self-attention) and locally-enhanced positional encoding (LePE).CSWSA is a type of multi-head self-attention, as shown in Fig. 4, which is realised by computing horizontal and vertical stripe self-attention of width sw in parallel.Consequently, under equivalent conditions, CSWSA possesses a larger receptive field compared to traditional window-based self-attention.In addition, since important positional information usually comes from the local domain and self-attention has alignment invariance, it can lead to the frequent neglect of positional information in two-dimensional images.Hence, LePE is employed in the position encoding of CSWSA to compute and capture local positional information.
The designed RCSTB in this paper combines the adaptive filtering characteristics of Transformers with the spatially invariant filtering properties of convolution 18 to enhance the modelling capability of TESR.

Edge detection auxiliary network
Edge information is one of the fundamental and critical features in image processing.For an image, having welldefined edges can enhance visual impact.Therefore, preserving or introducing certain edge information can enhance the visual appeal of super-resolution (SR) images.Currently, there is a range of traditional edge feature extraction operators, such as Canny, Sobel, and so on.Additionally, there are edge extraction networks based on CNN, such as HED 19 , RCF, and others.The traditional edge detection operator Canny is sensitive to noise, easily identifying positions with significant grayscale changes caused by noise as edges.Moreover, the low-level edge information obtained is challenging to represent high-level edge details.However, RCF integrates output results from different scales, facilitating multi-scale and multi-level learning for images.This enables the model to acquire more refined high-level edge features.Figure 5 illustrates a comparison of the edge extraction effects using Canny and RCF on super-resolution images and low-resolution images with a scaling factor of × 2. The edges extracted by the Canny operator (third row) contain a significant amount of noise and lack continuity.On the other hand, the edges extracted by RCF (fourth row) represent the contours of the main objects, with less noise and better continuity.Therefore, RCF is chosen as the primary method for edge extraction in this paper.When the image resolution is low, performing edge extraction on it not only results in limited information but also leads to poor continuity of edges, especially for small objects within the image.However, in the field of SR reconstruction, even basic interpolation algorithms like Bicubic can enhance the quality of an image, making the edge information more refined.As shown in Fig. 5, the edge extraction results are improved after using Bicubic interpolation, with the sixth row outperforming the third row, and the seventh row outperforming the fourth row.Therefore, Bicubic processing is applied before edge extraction in this paper.
This paper introduces an EDAN designed to assist in the reconstruction of SR images, as illustrated in Fig. 6.Firstly, the input image I LR is subjected to a 2 × Bicubic up-sampling to obtain I BLR , thereby increasing the number of edge pixels in the low-resolution infrared image.Subsequently, I BLR is fed into RCF to extract edge informa- tion, resulting in the edge map I E .Finally, I E undergoes refinement through Conv_block, composed of four 3 × 3 convolutional layers and one pooling layer, to obtain the edge feature F E .F E will be utilized to enhance the edge information of deep features.The entire process can be expressed as follows: where BIC(•) denotes the Bicubic module; RCF(•) denotes the RCF edge detection network; COB(•) denotes the Conv_block.
Moreover, the low-resolution image obtained by degrading the high-resolution image may not accurately capture certain details of the original image, as exemplified by the railing portion of the LR image of test image c in Fig. 5.To avoid amplifying erroneous edge information that may affect the results of super-resolution (SR) reconstruction, this paper introduces a balancing factor α(0 ≤ α ≤ 1) .Under the influence of α , the edge infor- mation of deep features is enhanced using F E , aiming to balance the conflict between edge features and deep features during the feature fusion stage.Further discussion on α is provided in Sect.3.4.

Experiments Datasets
In terms of experimental data, this paper utilized the datasets 20 Thermal101 and Thermal950 for SR reconstruction of infrared images proposed by Rivadeneira et al. 21,22 in 2019 and 2020, respectively.Due to the limited quantity of image data, the current study merged and organized these two datasets, resulting in a total of 950 training images, 50 validation images, and 50 testing images.The original image was taken as HR image data, the corresponding LR image was obtained by downsampling the HR image using Bicubic and adding the appropriate amount of Gaussian noise.To demonstrate the model's generalization, we also conducted comparative experiments with other infrared super-resolution reconstruction models using the infrared image dataset provided by Zou et al. 30 , which we simply denote as Thermal700 based on the number of images it contains.

Training details
This study employs the PyTorch framework to construct the experimental model.The CPU used is an Intel Core i9-12900KF, with 64GB of RAM, and a single Nvidia GeForce 3090Ti GPU.
In the training phase, for EDAN, the weight parameters provided by RCF 16 are first frozen and loaded.Subsequently, the entire TESR model is trained.Before feeding the LR images into the model for training, this paper extends the training data by rotating, translating, and flipping as a way to increase the number and diversity of training data.
Referring to the base model SwinIR, this paper sets the total number of iterations to 500,000.L 1 is chosen as the loss function, and Adam is employed as the optimizer with an initial learning rate of 0.0002.Learning rate decay is applied at 250,000, 400,000, 450,000, and 475,000 iterations, halving the learning rate at each decay point.The parameters RCSTB number, CSTL number, the width of horizontal or vertical stripes (sw), balance coefficient α, channel number, and the number of multi-head self-attention heads are set to 6, 6, 6, 0.1, 180, and 6, respectively.Due to the limitation of hardware devices, the batch sizes of TESR and SwinIR are set to 8 in this paper in the process of training, while keeping the other model hyperparameters consistent with their original paper.

Evaluation metrics
In this paper, peak signal to noise ratio (PSNR) and structural similarity index (SSIM) are chosen as the error evaluation function, assessed on the Y channel after conversion to the YCbCr color space.The expressions for PSNR and SSIM are as follows: In Eq. 11, M represents the maximum grayscale value of the image pixels, typically set to 255.MSE denotes the Mean Square Error between the reconstructed image and the high-resolution image.A higher PSNR indicates lower distortion and better quality of the generated image.In Eq. 12, µ x and µ y represent the means of the reconstructed image x and the high-resolution image y, respectively.σ x and σ y denote the variances of the recon- structed image x and the high-resolution image y, respectively.µ xy is the covariance between the reconstructed image x and the high-resolution image y. c 1 and c 2 are constants.The SSIM value is directly proportional to the similarity between the reconstructed image x and the high-resolution image y, a higher SSIM value indicates greater similarity between x and y.When the two images are identical, the SSIM value is 1.
Tables 1, 2, and 3 present the PSNR and SSIM results on the test set under scaling factors of × 2, × 4, and × 8, respectively.In terms of quantitative metrics, our model exhibits varying degrees of improvement compared to the traditional Bicubic method and various deep learning-based methods.Specifically, the PSNR improvement ranges from 1.8017 dB to 4.1446 dB at scaling factors of × 4 and × 8, and SSIM improvement ranges from 0.0111 to 0.0447.www.nature.com/scientificreports/Compared to other deep learning-based methods, TESR has slightly lower PSNR and SSIM metrics than HAT when the scaling factor is × 2. When the scaling factor is × 4, TESR also exhibits overall superior performance in terms of both PSNR and SSIM metrics.When the scaling factor is × 8, the TESR is even better in terms of metrics.The experimental results above indicate that TESR consistently achieves the optimal values or suboptimal values on the test dataset.Quantitative metrics have demonstrated the superiority of the proposed method.To further demonstrate the effectiveness of TESR from visual perspective.This paper selected three images from the test datasets to showcase the reconstruction results of various comparative methods at a scaling factor of × 8, as illustrated in Fig. 7.In order to provide a more intuitive display of the edge enhancement effect of our method TESR on reconstructed images, we selectively cropped portions of the reconstructed images containing long and slender objects for demonstration.The cropped sections are highlighted with red boxes.Moreover, for other scenes in the test images, the reconstruction performance of our method TESR is comparable to or even superior to other methods.Due to the significant scaling factor, it is evident from the images that the visual quality of the images reconstructed by most comparative methods is notably poor, exhibiting severe blurry artifacts.In contrast, TESR can restore more high-frequency details, achieving superior reconstruction results that closely resemble the original high-definition image (GT).For test image 1, most of the comparison methods are unable to reconstruct the railing portion, and some of them even fail to discern the conveyed information in the image.Only TESR is capable of reconstructing an image close to reality.In test image 2, earlier models such as SRGAN, ESRGAN, and VDSR produce reconstructed images that lose the main structure.On the other hand, recent methods like SwinIR and HAT can reconstruct the main contours but may not restore more detailed image edge information.Only TESR can achieve more refined results.In test image 3, other comparative methods fail to restore the clear structure of the soccer net.In contrast, the TESR model performs well in this regard.When the scaling factor is extremely large, the information contained in the low-resolution (LR) infrared images becomes highly limited, making it challenging for super-resolution (SR) methods to reconstruct valuable results.Most comparative methods perform poorly at high magnification factors.However, the proposed TESR in this study can acquire and utilize more useful information, leading to better reconstruction results.
To further validate the effectiveness of the model on infrared images, we conducted a comprehensive comparison between several state-of-the-art infrared SR methods and the proposed TESR, including TEN 27 , CDNMRF 28 , and AVRFN 31 .The above methods were tested on three different datasets, as shown in Table 4. Across all scaling factors, TESR outperforms the other methods by a wide margin in all metrics.

Ablation experiment
In order to verify the influence of incorporating edge information on the performance of infrared image superresolution reconstruction.This paper takes SwinIR as the base model and compares the performance by introducing parameters with randomly initialized RCF and EDAN (SwinIR + NP-EDAN) and edge pre-trained RCF and EDAN (SwinIR + EDAN).The comparative results are shown in Table 5.It can be seen from Table 5 that when the magnification factor is × 4, the "SwinIR + NP-EDAN" model exhibits a performance decline compared to the baseline SwinIR model.This is attributed to the fact that the randomly initialized parameters of the RCF do not provide additional edge information during the training process and may extract some irrelevant features, thereby affecting the performance of super-resolution reconstruction.However, the "SwinIR + EDAN" model, by virtue of fixing the RCF weights, reliably provides additional edge information, thereby enhancing edge features and improving the performance of infrared image super-resolution reconstruction.Thus, it can be concluded that introducing extra edge information can enhance the effectiveness of super-resolution reconstruction.
To substantiate the positive impact of CSTL and EDAN on TESR, this paper conducts ablation experiments on CSTL and EDAN with SwinIR as the base model while keeping the model parameters consistent.The results are shown in Table 6.
From Table 6, it can be observed that at scaling factors × 2, × 4, and × 8, SwinIR achieves PSNR values of 45.8080dB, 35.3988dB, and 29.9686dB, respectively, with corresponding SSIM values of 0.9914, 0.9276, and 0.8378.When CSTL is used alone, compared to the base model, the PSNR improves by 0.0636dB, 0.1699dB, and 0.1533dB at the three scaling factors, while the SSIM improves by 0.0001, 0.0023, and 0.0042, respectively.This is because CSTL has a larger receptive field, allowing the model to leverage deeper-level features and consequently achieve better reconstruction results.When EDAN is added independently at the three scaling factors, although it introduces some additional parameters, the performance improvement is noticeable.The PSNR increases by 0.4296dB, 0.0871dB, and 0.0211dB, while the SSIM increases by 0.0008, 0.0012, and 0.0007, respectively.The primary reason is that EDAN can enhance image edge information to help image reconstruction.When CSTL and EDAN are applied simultaneously, there is a further improvement in objective metrics.Because it is less

Parameter discussion
In CSWSA, the width (sw) of horizontal or vertical stripes is closely related to the size of the receptive field.Table 7 demonstrates the impact of sw on model performance at a magnification factor of × 4, revealing a positive correlation between PSNR, SSIM, and sw.This is because a wider stripe width can increase the model's receptive field, alleviating the issue of information loss due to the depth of the network.To balance learning capacity and computational complexity, sw is set to 6 in this paper.
In the process of SR reconstruction, the intensity of edge features can impact the reconstruction results, necessitating a discussion on the parameter α.In this paper, α is sequentially set to 0, 0.01, 0.1, and 1. demonstrates the influence of α on model performance at a magnification factor of × 4. From the table, it is evident that when α is set to 0.1, both PSNR and SSIM reach their maximum values.Therefore, α is set to 0.1 in this paper.
Considering the issue of the total model parameter count, although stacking RCSTB blocks can enhance the network's modeling capability, the improvement diminishes once a certain threshold is reached.Therefore, we discuss the impact of the parameter k on the parameter count and model performance.Table 9 shows the impact of the parameter k on model performance at a scaling factor of × 4. From the table, it can be observed that performance improvements are marginal beyond k = 4, while the number of parameters is still stacked normally.Therefore, k is set to 4 in this paper.

Conclusion
In this paper, we introduce a transformer-based edge-enhanced super-resolution model (TESR) for infrared image super-resolution reconstruction tasks.The CSWin Transformer layer in this model exhibits excellent long-range contextual modelling and global information capture capabilities.It can parallelly compute selfattention for horizontal and vertical stripe patterns, achieving improved reconstruction results without increasing computational complexity.Additionally, the proposed edge detection auxiliary network can extract fine-grained edge information.Using this edge information as supplementary data enhances the edges of the reconstructed infrared images.The experimental results indicate that our model outperforms current representative methods in terms of objective evaluation metrics, including PSNR and SSIM.In terms of subjective visual effects, our model demonstrates the ability to recover more high-frequency details, resulting in images with clearer edges.It is worth noting that because of the use of transformer as the main architecture and the introduction of an edge detection auxiliary network, the computational and parameter complexity of the network is relatively high.Future work will focus on optimizing the model to address this issue, enabling a lightweight version that maintains superior performance.

Figure 5 .
Figure 5.Comparison of low-resolution images (LR) and super-resolution images (LR_Bicubic processing) and the effect of edge extraction using Canny and RCF respectively.

Figure 7 .
Figure 7.Comparison of visual effects of different methods at × 8 scale factor.
9winIRSwinIR9consists of three components: a shallow feature extraction module, a deep feature extraction module, and a reconstruction module.The shallow feature extraction module employs 3 × 3 convolutional layers to extract shallow features.The deep feature extraction module is primarily composed of multiple RSTBs (residual Swin Transformer blocks) and a 3 × 3 convolutional layer for feature enhancement.Each RSTB utilizes multiple STLs (Swin Transformer layers) for local attention and interaction across different receptive fields, thereby enhancing the model's expressive capabilities.The reconstruction module integrates both shallow and deep features for image reconstruction.SwinIR integrates the features of CNN and transformer to achieve good experimental results on typical super-resolution datasets, which demonstrates the effectiveness of applying transformer to low-level visual tasks.

Table 1 .
Reconstruction results with a scale factor of × 2. Significant values are in bold.

Table 2 .
Reconstruction results with a scale factor of × 4. Significant values are in bold.

Table 3 .
Reconstruction results with a scale factor of × 8. Significant values are in bold.

Table 4 .
Quantitative comparison of various infrared SR methods on different datasets.Significant values are in bold.

Table 5 .
Reconstruction performance comparison of SwinIR introducing pre-trained RCF and non-pretrained RCF.Significant values are in bold.difficult to reconstruct at a scaling factor of × 2, there is no significant difference between the ablation results.We list the results for scaling factors of × 4 and × 8 in Fig.8.It can be clearly seen that using CSTL and EDAN simultaneously can reconstruct clear contours without distortion.

Table 8 Table 6 .
Results of ablation experiments.Significant values are in bold.Visual comparison of ablation experiments at scaling factors of × 4 and × 8.

Table 7 .
Results of sw size discussion.Significant values are in bold.

Table 8 .
The results of the discussion on the value of α.Significant values are in bold.

Table 9 .
The results of the discussion on the value of k.Significant values are in bold.